1. Introduction

Heart disease remains the leading cause of death globally, accounting for millions of deaths annually. The ability to accurately predict heart disease can lead to early diagnosis and intervention, significantly reducing the mortality rate associated with this condition. This project aims to leverage data science and machine learning techniques to analyze and predict the likelihood of heart attacks based on a wide range of factors including age, sex, chest pain type, resting blood pressure, serum cholesterol, and more.

Utilizing a dataset from Kaggle, this analysis will explore various features that may influence heart disease, implement several machine learning models to predict heart disease occurrence, and evaluate the performance of these models. The dataset includes diverse variables, offering a comprehensive insight into the factors that might contribute to heart disease.

Through this project, we aim to uncover significant predictors of heart disease, assess the predictive power of machine learning models in a medical context, and ultimately, contribute to the efforts of preventing heart disease by enabling early detection. This endeavor is not only a technical challenge but also a crucial step towards saving lives and improving health outcomes.

As we delve into the data and models, our goal is to present a clear, concise, and informative analysis that can serve as a foundation for further research and practical applications in the field of medical science and public health.

1.1 Examining the Project Topic

What is Heart Attack?

  • The medical name of heart attack is “Myocardial infarction”.
  • Heart attack in short; It is the occlusion of the vessel by plaque-like lesions filled with cholesterol and fat.
  • The lesion is called abnormal conditions that occur in the organs where the disease is located.
  • As a result of the blockage, the blood flow is completely cut off and a heart attack that can lead to death occurs.

How does a heart attack occur?

  • The heart is like an efficient pump, tirelessly working to circulate blood throughout the body about 60 to 80 times a minute when we’re at rest. Just like any other part of the body, the heart itself needs a steady supply of nutrients and oxygen, which it gets through its very own set of blood vessels known as coronary arteries.

  • Sometimes, these crucial arteries can have trouble with their blood flow due to blockages or narrowing, a condition known as coronary insufficiency. This issue can vary widely – it all depends on where the blockage is, how severe it is, and the specific arteries affected.

  • For some, this might just mean chest pain that pops up during a workout or some physical task, but goes away once they take a break. However, in more serious cases, if a coronary artery gets suddenly completely blocked, it can trigger a heart attack. This often starts with intense chest pain and, in the worst-case scenario, can lead to sudden death.

1.2 Recognizing Variables In Dataset

Variable definitions in the Dataset

  • Age: Age of the patient
  • Sex: Sex of the patient
  • exang: exercise induced angina (1 = yes; 0 = no)
  • ca: number of major vessels (0-3)
  • cp: Chest Pain type chest pain type
    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trtbps: resting blood pressure (in mm Hg)
  • chol: cholestoral in mg/dl fetched via BMI sensor
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • rest_ecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • target: 0= less chance of heart attack 1= more chance of heart attack

Additional variable descriptions to help us

  1. age - age in years

  2. sex - sex (1 = male; 0 = female)

  3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 = asymptomatic)

  4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)

  5. chol - serum cholestoral in mg/dl

  6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

  7. restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality; 0 = hypertrophy)

  8. thalach - maximum heart rate achieved

  9. exang - exercise induced angina (1 = yes; 0 = no)

  10. oldpeak - ST depression induced by exercise relative to rest

  11. slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)

  12. ca - number of major vessels (0-3) colored by flourosopy

  13. thal - 2 = normal; 1 = fixed defect; 3 = reversable defect

  14. num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)

2. First Organization

2.1 Required Python Libraries

2.1.1 Basic Libraries

library(corrplot)  # for the correlation plot
library(discrim)  # for linear discriminant analysis
library(corrr)   # for calculating correlation
library(knitr)   # to help with the knitting process
library(MASS)    # to assist with the markdown processes
library(tidyverse)   # using tidyverse and tidymodels for this project mostly
library(tidymodels)
library(ggplot2)   # for most of our visualizations
library(ggrepel)
library(rpart.plot)  # for visualizing trees
library(vip)         # for variable importance 
library(janitor)     # for cleaning out our data
library(ranger)   # for building our randomForest
library(dplyr)     # for basic r functions
library(yardstick) # for measuring certain metrics
library(naniar)
tidymodels_prefer()

2.2 Loading The Dataset

Heart_data <- read_csv("data/heart.csv", show_col_types = FALSE)
Heart_data
## # A tibble: 303 × 14
##      age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
##  1    63     1     3    145   233     1       0      150     0     2.3     0
##  2    37     1     2    130   250     0       1      187     0     3.5     0
##  3    41     0     1    130   204     0       0      172     0     1.4     2
##  4    56     1     1    120   236     0       1      178     0     0.8     2
##  5    57     0     0    120   354     0       1      163     1     0.6     2
##  6    57     1     0    140   192     0       1      148     0     0.4     1
##  7    56     0     1    140   294     0       0      153     0     1.3     1
##  8    44     1     1    120   263     0       1      173     0     0       2
##  9    52     1     2    172   199     1       1      162     0     0.5     2
## 10    57     1     2    150   168     0       1      174     0     1.6     2
## # ℹ 293 more rows
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

3. Preparation for Exploratory Data Analysis(EDA)

3.1 Examining Missing Values

vis_miss(Heart_data)

This plot shows us at a glance that there is no missing data in the whole dataset.

3.2 Examining Unique Values

unique_number <- sapply(Heart_data, function(x) length(unique(x)))
unique_values_df <- data.frame("Total Unique Values" = unique_number)
rownames(unique_values_df) <- names(Heart_data)
print(unique_values_df)
##          Total.Unique.Values
## age                       41
## sex                        2
## cp                         4
## trtbps                    49
## chol                     152
## fbs                        2
## restecg                    3
## thalachh                  91
## exng                       2
## oldpeak                   40
## slp                        3
## caa                        5
## thall                      4
## output                     2

Analysis Outputs(1)

  • According to the result from the unique value dataframe;
  • We determined the variables with few unique values ​​as categorical variables, and the variables with high unique values ​​as numeric variables.
  • In this context, Numeric Variables: “age”, “trtbps”, “chol”, “thalachh” and “oldpeak”
  • Categorical Variables: “sex”, “cp”, “fbs”, “restecg”, “exng”, “slp”, “caa”, “thall”, “output”

3.3 Separating variables (Numeric or Categorical)

numeric_var <- c("age", "trtbps", "chol", "thalachh", "oldpeak")

categoric_var <- c("sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall", "output")

Heart_data %>%
  select(numeric_var) %>%
  summary()
##       age            trtbps           chol          thalachh        oldpeak    
##  Min.   :29.00   Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:47.50   1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :55.00   Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :54.37   Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :77.00   Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20

4. Exploratory Data Analysis(EDA)

4.1 variable Analysis

4.1.1 Numerical Variables(Analysis with Distplot)

ggplot(Heart_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  ggtitle("Age Distribution") +
  xlab("Age") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = trtbps)) +
  geom_histogram(binwidth = 5, fill = "red", color = "black") +
  ggtitle(" resting blood pressure (in mm Hg) Distribution") +
  xlab(" resting blood pressure (in mm Hg)") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = chol)) +
  geom_histogram(binwidth = 5, fill = "purple", color = "black") +
  ggtitle(" cholestoral Distribution") +
  xlab(" cholestoral") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = thalachh)) +
  geom_histogram(binwidth = 5, fill = "green", color = "black") +
  ggtitle(" maximum heart rate achieved Distribution") +
  xlab(" maximum heart rate achieved") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = oldpeak)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  ggtitle(" ST depression induced by exercise relative to rest Distribution") +
  xlab(" ST depression induced by exercise relative to rest") +
  ylab("Frequency")

Analysis Outputs(1)

Age Variable
  • The vast majority of patients are between 50 and 60.
  • There is a remarkable place on the chart. There is a decrease in patients between the ages of 47-and 50.
  • It looks like there are no outliers in the variable.
Trtbps Variable
  • The resting blood pressure of most patients is generally between 110 and 140.
  • Values after 180 can be considered as outliers.
  • There is hefty patient traffic between 115-120, 125-130, and 155-160 values.
Cholesterol Variable
  • Cholesterol value in most patients is between 200-and 280.
  • Values after 380 can be considered as outliers.
Thalach Variable
  • The maximum heart rate achieved in most patients is between 145-and 170.
  • In particular, The values before 80 can be considered outliers.
Oldpeak Variable
  • Values of the vast majority of patients in the variable range from 0 to 1.5.
  • Especially values after 2.5 can be considered as outliers.

4.1.2 Categorical Variables(Analysis with Pie Chart)

categoric_var
## [1] "sex"     "cp"      "fbs"     "restecg" "exng"    "slp"     "caa"    
## [8] "thall"   "output"
heart_data_summary_sex <- Heart_data %>%
  count(sex) %>%
  mutate(perc = n / sum(n))

ggplot(heart_data_summary_sex, aes(x = "", y = n, fill = factor(sex))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#FF5733", "#33B5FF")) +
  labs(title = "sex (Gender)", fill = "sex") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))   

heart_data_summary_cp <- Heart_data %>%
  count(cp) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF", "#CDDC39", "#9C27B0")

ggplot(heart_data_summary_cp, aes(x = "", y = n, fill = factor(cp))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "cp (Chest Pain type )", fill = "cp") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_fbs <- Heart_data %>%
  count(fbs) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF")



ggplot(heart_data_summary_fbs, aes(x = "", y = n, fill = factor(fbs))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "fbs (fasting blood sugar )", fill = "fbs") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_restecg <- Heart_data %>%
  count(restecg) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF", "#CDDC39")



ggplot(heart_data_summary_restecg, aes(x = "", y = n, fill = factor(restecg))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "restecg (resting electrocardiographic results)", fill = "restecg") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_exng <- Heart_data %>%
  count(exng) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#CDDC39")



ggplot(heart_data_summary_exng, aes(x = "", y = n, fill = factor(exng))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "exng (exercise induced angina)", fill = "exng") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_slp <- Heart_data %>%
  count(slp) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#CDDC39", "#9C27B0")



ggplot(heart_data_summary_slp, aes(x = "", y = n, fill = factor(slp))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "slp (the slope of the peak exercise ST segment)", fill = "slp") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_caa <- Heart_data %>%
  count(caa) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_caa, aes(x = "", y = n, fill = factor(caa))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "caa (number of major vessels)", fill = "caa") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_thall <- Heart_data %>%
  count(thall) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_thall, aes(x = "", y = n, fill = factor(thall))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "thall (Thallium stress test)", fill = "thall") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_output <- Heart_data %>%
  count(output) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_output, aes(x = "", y = n, fill = factor(output))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "output (diagnosis of heart disease)", fill = "output") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

4.1.2.1 Analysis Outputs(2)

Sex Variable

  • 68% of the patients are male, 32% are female.
  • So, the number of male patients is more than twice that of female patients.

Cp Variable

  • Almost half of the patients have an observation value of 0. In other words, there is asymptomatic angina
  • Half of the patients are asymptomatic; they have pain without symptoms.
  • If we examine the other half of the pie chart, 1 out of 4 patients has an observation value of 2.
  • In other words, atypical angina is in 29% of the patients.
  • This observation value shows patients with shortness of breath or non-classical pain.
  • The other two observation values are less than the others.
  • 16.5% of patients have a value of 1. In other words, typical angina is seen. Typical angina is the classic exertion pain that comes during any physical activity.
  • The other 8% has the value of non-anginal pain, which is three types of angina.
  • Non-anginal pain is the term used to describe chest pain that is not caused by heart disease or a heart attack.

Fbs Variable

  • The vast majority of patients have an observation value of 1. In other words, 85%.
  • The fasting blood sugar of these patients is more than 120 mg/dl.
  • The remaining 15 percent have a less than 120 mg/dl fasting blood glucose level.

Restecg Variable

  • The thing that draws attention to the image of this variable is that the number of patients with two observation values is negligible.
  • It has a value of 1.3 percent. When we look at all of these patients, it is not a very important number.
  • This value represents the ST and T wavelengths of the patients.
  • Another point that draws attention to this graph is; The total numbers of other patients with observation values of 1 and 0 are almost equal.
  • The size of those with 1, that is, the orange part on the graph is 50.2%
  • This means that the resting electrocardiographic results of these patients are normal.
  • The percentage of patients with a value of 0 is 48.5%.
  • That is, the patients’ values of 48.5% are normal.

Exang Variable

  • We have said that this variable stands for exercise-induced angina.
  • Angina is the chest pain caused by the coronary artery’s involuntary contraction that feeds the heart.
  • According to the variable “exang,” the pain caused by this angina is represented by a value of 1 if it occurs with any exercise and 0 if it does not.
  • In this context, Values 0 are more than twice as values 1. More than half of the patients do not have exercise-induced angina.

Slp Variable

  • The minimum observation value is 0 with 7 percent.
  • This is patients with a downward slope of the ST wavelength.
  • The other two observation values are almost equal to each other.
  • The ST wavelength of half of the remaining patients is 1, that is straight, while the observation value of the other half is 2, that is, the ST wavelength is sloped upwards.

Caa variable

  • This variable is the number of great vessels colored by fluoroscopy.
  • In more than half of the patients, 57.8 percent, the number of large vessels is 0. That is, the number of large vessels colored by fluoroscopy is absent.
  • After 0 observation value, the other value with the most slices in the pie chart 1
  • The number of large vessels observed in 21.5% of the patients is 1
  • The majority of patients have an occlusion in their veins. Therefore, large vessels cannot be observed with the fluoroscopy technique.

Thall Variable

  • The “Thal” variable is short for the “Thallium stress test.”
  • The thallium stress test is simply an imaging method that evaluates the amount of blood reaching the heart muscle and determines whether a person has coronary artery disease.
  • There are three observation values in the description of this variable. However, the pie chart shows four values. Values 0, 1, 2 and 3.
  • According to our research, the observation value of 0 is null. Therefore, in the next step, 0 observation values will be returned to null and filled with logical data.
  • In this context, according to the thallium stress test results, 54.8 percent of the patients have two observation values, so the test result appears to be expected.
  • 36.8 percent has a value of 3, so we can say that this value is a reversible defect as an explanation.
  • 5.9 percent of patients have a value of 1, so the test result for these patients is a fixed defect.

Output Variable

  • More than half of the patients, 54.5 percent, have a heart attack risk. The remaining 45.5 percent have no heart attack risk.

4.1.2.2 Examining the Missing Data According to the Analysis Result

thal_zero_rows <- Heart_data %>% filter(thall == 0)
thal_zero_rows
## # A tibble: 2 × 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    53     0     2    128   216     0       0      115     0       0     2
## 2    52     1     0    128   204     1       1      156     1       1     1
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
# 0 will be filled with 2 that is most common value in thall
Heart_data$thall[Heart_data$thall == 0] <- 2

unique_thal_categories <- unique(Heart_data$thall)
unique_thal_categories
## [1] 1 2 3
vis_miss(Heart_data)

4.2.3 Numerical Variables - Categorical Variables (Analysis with Box Plot)

Heart_data_long <- Heart_data %>%
  gather(key = "variables", value = "value", trtbps, chol, thalachh)

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(sex))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "Sex") 

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(cp))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "Cp") 

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output") 

Heart_data_long_2 <- Heart_data %>%
  gather(key = "variables", value = "value",age)

ggplot(Heart_data_long_2, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output") 

Heart_data_long_3 <- Heart_data %>%
  gather(key = "variables", value = "value",oldpeak)

ggplot(Heart_data_long_3, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output")

4.2.3.1 Analysis Outputs(3)

Sex - Numeric Variables
  • There is no very high correlation between “sex” and numerical variables. There is a weak relationship with all of them.
  • When the boxes of the observation values ​​of the gender variable are examined, it is seen that it is difficult to distinguish from each other.
Output - Numeric Variables
  • In the “old peak” variable, the median value of the blue box goes outside the red box. In other words, it shows that there is a more significant relationship between the “old peak” variable and the target compared to other numerical variables. We can say that there is a medium level of correlation.
  • There is also a correlation between the “thalach” and the output variables. Again, the median of the blue box goes outside the red box. This is an indication that there is a correlation compared to the others.
  • When we examine the other three variables, there is not much correlation.

4.3 Relationships between variables

cor_mat <- cor(Heart_data)
corrplot(cor_mat, method = "color", 
         addCoef.col = "black",   
         tl.cex = 0.8,            
         number.cex = 0.6,        
         cl.cex = 0.8,            
         tl.col = "black",        
         tl.srt = 45,             
         order = "hclust" )            

#### 4.2.3.1 Analysis Outputs(4)

Age Variable
  • The variable with the highest correlation with the “age” variable is the “thalachh” variable. There is a negative correlation between them, which we can call moderately.
  • The severity of the correlation is -0.40. In other words, there is an inverse relationship between the “age” and “thalachh” variables.
Trtbps Variable
  • The variable with the highest correlation with the “trtbps” variable is the “age” variable. The correlation between them is 0.28
  • There is a positive low-intensity correlation.
Chol Variable
  • The variable with the highest correlation with the “chol” variable is the “age” variable
  • There is a correlation with a magnitude of 0.21. This is a low positive correlation.
  • So, we can say that as age increases, cholesterol also increases.
Thalachh Variable
  • The variable with the highest correlation to the “Thalachh” variable is the “output” variable.
  • There is a 0.42 positive and moderate correlation between them. In other words, it is a variable that can directly trigger a heart attack.
  • There is a variable with which this variable has many correlations.
  • It means that the maximum heart rate reached maybe a situation triggered by other variables.
Oldpeak Variable
  • It has the most significant correlation ratio among this changing table. This correlation is -0.58 with the “slp” variable.
  • There is a negative correlation between them, which is slightly above medium intensity.
Sex Variable
  • There is no robust correlation between the variable “Sex” and other variables.
Fbs Variable
  • The “Fbs” variable generally does not correlate with other variables.
  • The highest correlation with 0.18 belongs to the “trtbps” variable. There is a low positive correlation.
Restecg Variable
  • There is no strong correlation between the “Rest_ecg” variable and other variables.
  • The highest correlation was 0.14 with the “output” variable. There is a positive low-intensity correlation.
output Variable
  • The “output” variable correlates with more than one variable.

5. Preparation for Modeling

5.1 Dropping Columns with Low Correlation

Heart_data <- Heart_data %>% select(-c(chol, fbs, restecg))
Heart_data
## # A tibble: 303 × 11
##      age   sex    cp trtbps thalachh  exng oldpeak   slp   caa thall output
##    <dbl> <dbl> <dbl>  <dbl>    <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1    63     1     3    145      150     0     2.3     0     0     1      1
##  2    37     1     2    130      187     0     3.5     0     0     2      1
##  3    41     0     1    130      172     0     1.4     2     0     2      1
##  4    56     1     1    120      178     0     0.8     2     0     2      1
##  5    57     0     0    120      163     1     0.6     2     0     2      1
##  6    57     1     0    140      148     0     0.4     1     0     1      1
##  7    56     0     1    140      153     0     1.3     1     0     2      1
##  8    44     1     1    120      173     0     0       2     0     3      1
##  9    52     1     2    172      162     0     0.5     2     0     3      1
## 10    57     1     2    150      174     0     1.6     2     0     2      1
## # ℹ 293 more rows

5.2